431 Class 25

Thomas E. Love, Ph.D.

2023-12-05

Getting Started

  • Some modeling with Favorite Movies from 2023-10-24.
    • Note: I downloaded our Google Sheet to an Excel .xlsx file.
  • Selecting an Outcome
  • How Many Predictors Can We Include?
  • Multiple and Single Imputation with mice
  • Using functions from ggmice and corrplot to visualize
  • K-fold Cross-Validation with caret package

Definitely not before you finish Quiz 2.

  • I wrote the quiz and sketch in 4.3.1.
  • After that, it’s up to you. If you use version 4.3.1 or later for your Project B, I’m fine. I’ll start caring about 4.3.2 in 432.
  • You will probably want to update your packages either way.
  • I have completed the upgrade to 4.3.2. for these slides, as you can see in the Session Information.

Today’s Packages

library(here)
library(readxl)
library(janitor); library(gt)
library(mosaic); library(patchwork)
library(naniar); library(mice); library(mitml)
library(car); library(GGally)
library(broom); library(xfun)
library(corrplot)  # NEW: plotting a correlation matrix
library(ggmice)    # NEW: visualizing missingness and imputations
library(caret)     # NEW: k-fold cross-validation
library(tidyverse)

theme_set(theme_bw())

Ingesting the data

movie_raw <- read_excel(here("c25/data/movies_2023-10-24.xlsx"),
                        na = c("", "NA")) |> # otherwise only "" is recognized
  clean_names() |>
  type.convert(as.is = FALSE) |>   # convert all characters to factors
  mutate(film_id = as.character(film_id), 
         film = as.character(film))

movies <- movie_raw |>
  select(film_id, imdb_pct10, fc_pctwins, rt_audiencescore, 
         ebert, box_off_mult, budget, metascore, bw_rating, imdb_oscars, 
         mentions, dr_love, gen_1, bacon_1, lang_1, 
         drama, comedy, adventure, action, romance, fantasy, 
         sci_fi, crime, thriller, animation, family, mystery, 
         biography, music, horror, musical, war, history,
         sport, western, film)

dim(movies)
[1] 201  36

Quick Check of Ingest

summary(movies)
   film_id            imdb_pct10      fc_pctwins rt_audiencescore
 Length:201         Min.   : 3.80   Min.   :24   Min.   :28.00   
 Class :character   1st Qu.:11.60   1st Qu.:42   1st Qu.:76.00   
 Mode  :character   Median :15.60   Median :52   Median :86.00   
                    Mean   :17.53   Mean   :51   Mean   :81.91   
                    3rd Qu.:22.20   3rd Qu.:60   3rd Qu.:92.00   
                    Max.   :55.00   Max.   :79   Max.   :98.00   
                                                                 
     ebert        box_off_mult         budget            metascore     
 Min.   :1.000   Min.   : 0.0013   Min.   :   200000   Min.   :  9.00  
 1st Qu.:2.875   1st Qu.: 2.6000   1st Qu.: 12000000   1st Qu.: 61.00  
 Median :3.500   Median : 4.7000   Median : 30000000   Median : 72.00  
 Mean   :3.190   Mean   : 8.5418   Mean   : 59242257   Mean   : 71.35  
 3rd Qu.:4.000   3rd Qu.: 9.3000   3rd Qu.: 90000000   3rd Qu.: 84.00  
 Max.   :4.000   Max.   :73.7000   Max.   :356000000   Max.   :100.00  
 NA's   :25      NA's   :20        NA's   :19          NA's   :10      
   bw_rating      imdb_oscars         mentions     dr_love   gen_1  
 Min.   :0.000   Min.   : 0.0000   Min.   :1.000   No :124   F: 45  
 1st Qu.:1.000   1st Qu.: 0.0000   1st Qu.:1.000   Yes: 77   M:156  
 Median :3.000   Median : 0.0000   Median :1.000                    
 Mean   :2.135   Mean   : 0.9849   Mean   :1.249                    
 3rd Qu.:3.000   3rd Qu.: 1.0000   3rd Qu.:1.000                    
 Max.   :3.000   Max.   :11.0000   Max.   :6.000                    
 NA's   :8       NA's   :2                                          
    bacon_1           lang_1        drama            comedy      
 Min.   :1.000   English :177   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2.000   Japanese:  7   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :2.000   Hindi   :  5   Median :1.0000   Median :0.0000  
 Mean   :1.886   Italian :  2   Mean   :0.5721   Mean   :0.3582  
 3rd Qu.:2.000   Arabic  :  1   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :3.000   ASL     :  1   Max.   :1.0000   Max.   :1.0000  
                 (Other) :  8                                    
   adventure          action          romance          fantasy      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.3333   Mean   :0.2537   Mean   :0.1692   Mean   :0.1393  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
                                                                    
     sci_fi           crime           thriller        animation      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.00000  
 Mean   :0.1244   Mean   :0.1045   Mean   :0.1045   Mean   :0.08955  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
                                                                     
     family           mystery         biography           music        
 Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :0.00000   Median :0.0000   Median :0.00000   Median :0.00000  
 Mean   :0.08955   Mean   :0.0597   Mean   :0.05473   Mean   :0.05473  
 3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
                                                                       
     horror          musical             war            history       
 Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :0.0000   Median :0.00000   Median :0.0000   Median :0.00000  
 Mean   :0.0398   Mean   :0.02985   Mean   :0.0199   Mean   :0.01493  
 3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.00000  
                                                                      
     sport            western             film          
 Min.   :0.00000   Min.   :0.000000   Length:201        
 1st Qu.:0.00000   1st Qu.:0.000000   Class :character  
 Median :0.00000   Median :0.000000   Mode  :character  
 Mean   :0.01493   Mean   :0.004975                     
 3rd Qu.:0.00000   3rd Qu.:0.000000                     
 Max.   :1.00000   Max.   :1.000000                     
                                                        

Data Cleaning

  1. Let’s convert budget to express it in millions of US dollars
  2. lang_eng should be 1/0 for English (n = 177) vs. Non-English
movies <- movies |>
  mutate(budget = budget / 1000000,
         lang_eng = as.numeric(lang_1 == "English"))

favstats(~ budget, data = movies) |> gt() |> 
  fmt_number(columns = mean:sd, decimals = 2) 
min Q1 median Q3 max mean sd n missing
0.2 12 30 90 356 59.24 68.02 182 19
movies |> tabyl(lang_eng, lang_1) |> gt() 
lang_eng Arabic ASL Bengali Danish English French German Hindi Italian Japanese Mandarin Norwegian Persian Spanish
0 1 1 1 1 0 1 1 5 2 7 1 1 1 1
1 0 0 0 0 177 0 0 0 0 0 0 0 0 0

Which outcome shall we choose?

We’re interested in a percentage measure (0-100) addressing how beloved the movie is, according to an audience.

Variable NA Description
imdb_pct10 0 % of 10-star public ratings in IMDB as of 2023-09
fc_pctwins 0 % of matchups won on Flickchart as of 2023-10
rt_audiencescore 0 Rotten Tomatoes Audience Score (% Fresh) as of 2023-10

Which outcome shall we choose?

p1 <- ggplot(data = movies, aes(x = imdb_pct10)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 2, 
                 fill = "navy", col = "white") +
  scale_x_continuous(breaks = seq(0, 100, by = 10), limits = c(0, 100)) +
  stat_function(fun = dnorm, 
                args = list(mean = mean(movies$imdb_pct10),
                            sd = sd(movies$imdb_pct10)),
                col = "magenta", linewidth = 1.5)

p2 <- ggplot(data = movies, aes(x = fc_pctwins)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 2, 
                 fill = "navy", col = "white") +
  scale_x_continuous(breaks = seq(0, 100, by = 10), limits = c(0, 100)) +
  stat_function(fun = dnorm, 
                args = list(mean = mean(movies$fc_pctwins),
                            sd = sd(movies$fc_pctwins)),
                col = "magenta", linewidth = 1.5)

p3 <- ggplot(data = movies, aes(x = rt_audiencescore)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 2, 
                 fill = "navy", col = "white") +
  scale_x_continuous(breaks = seq(0, 100, by = 10), limits = c(0, 100)) +
  stat_function(fun = dnorm, 
                args = list(mean = mean(movies$rt_audiencescore),
                            sd = sd(movies$rt_audiencescore)),
                col = "magenta", linewidth = 1.5)

p1 / p2 / p3

Which outcome shall we choose?

Which outcome shall we choose?

df_stats(~ imdb_pct10 + fc_pctwins + rt_audiencescore, data = movies) |>
  gt() |> fmt_number(columns = min:sd, decimals = 1) 
response min Q1 median Q3 max mean sd n missing
imdb_pct10 3.8 11.6 15.6 22.2 55.0 17.5 9.1 201 0
fc_pctwins 24.0 42.0 52.0 60.0 79.0 51.0 12.6 201 0
rt_audiencescore 28.0 76.0 86.0 92.0 98.0 81.9 13.5 201 0

What is fc_pctwins?

  • Our variable is the % of its “matchups” that the movie wins.

flickchart.com

My Favorite Movie

Our Top 10 fc_pctwins scores

movies |> arrange(-fc_pctwins) |> select(fc_pctwins, film) |> head(10) |> gt()
fc_pctwins film
79 The Dark Knight
77 The Shawshank Redemption
77 Star Wars: Episode IV - A New Hope
77 Star Wars: Episode V - The Empire Strikes Back
76 Pulp Fiction
74 The Godfather
73 Inception
73 Raiders of the Lost Ark
71 Inglourious Basterds
71 Star Wars: Episode VI - Return of the Jedi

Our bottom 13 fc_pctwins scores

movies |> arrange(fc_pctwins) |> select(fc_pctwins, film) |> head(13) |> gt()
fc_pctwins film
24 Bee Movie
24 High School Musical 2
25 The Gingerdead Man
25 How the Grinch Stole Christmas
26 House Party 2
27 Mamma Mia!
28 Get Smart
28 My Big Fat Greek Wedding
29 Loving Adults
29 Madea Goes To Jail
29 Mission Impossible II
29 Night at the Museum: Battle of the Smithsonian
29 A Walk to Remember

Available Predictors for fc_pctwins

str(movies |> select(-fc_pctwins))
tibble [201 × 36] (S3: tbl_df/tbl/data.frame)
 $ film_id         : chr [1:201] "M001" "M002" "M003" "M004" ...
 $ imdb_pct10      : num [1:201] 37.3 27.2 14.9 30.1 22.1 17.1 22.7 22.5 21.2 32.4 ...
 $ rt_audiencescore: int [1:201] 93 92 69 89 84 81 94 95 82 92 ...
 $ ebert           : num [1:201] NA 4 2.5 4 4 2.5 4 4 4 2.5 ...
 $ box_off_mult    : num [1:201] 9.1 NA 1.8 5.5 NA 7.3 9.7 2.9 12.3 7.9 ...
 $ budget          : num [1:201] 6.63 NA 30 12 NA ...
 $ metascore       : int [1:201] 67 93 70 84 87 55 89 87 83 78 ...
 $ bw_rating       : int [1:201] 1 3 3 0 3 3 3 3 3 3 ...
 $ imdb_oscars     : int [1:201] 0 2 0 1 0 0 1 8 3 0 ...
 $ mentions        : int [1:201] 1 1 1 1 1 2 1 1 1 1 ...
 $ dr_love         : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 2 1 2 ...
 $ gen_1           : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 2 2 2 ...
 $ bacon_1         : int [1:201] 3 2 2 2 2 2 2 2 2 2 ...
 $ lang_1          : Factor w/ 14 levels "Arabic","ASL",..: 8 9 5 5 13 5 5 5 5 5 ...
 $ drama           : int [1:201] 1 1 1 0 1 1 0 1 0 0 ...
 $ comedy          : int [1:201] 1 0 1 0 0 1 0 0 0 0 ...
 $ adventure       : int [1:201] 0 0 0 1 0 0 0 0 1 1 ...
 $ action          : int [1:201] 0 0 0 0 0 0 0 0 1 1 ...
 $ romance         : int [1:201] 0 0 1 0 0 0 0 0 0 0 ...
 $ fantasy         : int [1:201] 0 0 0 0 0 1 0 0 1 0 ...
 $ sci_fi          : int [1:201] 0 0 0 1 0 0 1 0 0 1 ...
 $ crime           : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ thriller        : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ animation       : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ family          : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ mystery         : int [1:201] 0 0 0 0 1 0 0 0 0 0 ...
 $ biography       : int [1:201] 0 0 0 0 0 0 0 1 0 0 ...
 $ music           : int [1:201] 0 0 0 0 0 0 0 1 0 0 ...
 $ horror          : int [1:201] 0 0 0 0 0 0 1 0 0 0 ...
 $ musical         : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ war             : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ history         : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ sport           : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ western         : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ film            : chr [1:201] "3 Idiots" "8 1/2" "10 Things I Hate About You" "2001: A Space Odyssey" ...
 $ lang_eng        : num [1:201] 0 0 1 1 0 1 1 1 1 1 ...

First Cut: 18 predictors

Quantities

  • imdb_pct10, rt_audiencescore, box_off_mult, budget, metascore

Counts or Multi-categorical and ordinal

  • ebert, imdb_oscars, mentions, bw_rating, bacon_1

Binary Categorical

  • dr_love, gen_1, lang_eng, and
  • top five genres: drama, comedy, adventure, action, romance

How many predictors can we use?

If we have a linear regression model with 201 observations (at most, some variables are missing, remember), then how many predictors can we realistically fit?

Important

A useful starting strategy when you’re not doing variable selection is that you need at least 15 observations for each coefficient you will estimate, including the intercept.

See https://hbiostat.org/bbr/ Frank Harrell, Biostatistics for Biomedical Research for more on this topic.

How Many Predictors (at maximum)?

Important

A useful starting strategy when you’re not doing variable selection is that you need at least 15 observations for each coefficient you will estimate, including the intercept.

  • The model will run, so long as you have more observations than cases, but that’s not a good standard to use.
  • Bigger samples are better, but sample size is often determined by pragmatic considerations.

A more modern (and complex) answer is found in this Riley et al (2020) BMJ article.

201 / 15 = 13.4 coefficients

Important

A useful starting strategy when you’re not doing variable selection is that you need at least 15 observations for each coefficient you will estimate, including the intercept.

13 is really a maximum. We’d like to avoid fitting more than perhaps 10 coefficients (including the intercept)…

  • Each quantitative predictor requires one coefficient
  • Each binary predictor also requires one coefficient
  • When treated as multi-categorical, a factor with k levels requires k-1 coefficients

Second Cut: 9 predictors

Variable Type Description
imdb_pct10 Quant % of 10-star public ratings in IMDB
rt_audiencescore Quant Rotten Tomatoes Audience Score (% Fresh)
box_off_mult Quant World Wide Gross Revenue (as multiple of budget)
metascore Quant Metascore (0-100 scale) from critic reviews
imdb_oscars Quant # of Oscar (Academy Award) wins
bw_rating Quant Bechdel-Wallace Test Criteria Met (0-3)
lang_eng Binary Is primary language English? (1 = Yes, 0 = No)
drama Binary Is drama listed in imdb_categories? (1 = Yes, 0 = No)
comedy Binary Is comedy listed in imdb_categories? (1 = Yes, 0 = No)
  • 10 coefficients x 15 = 150 observations needed, at minimum. We have 201.

Create movies_2

movies_2 <- movies |>
  select(film_id, fc_pctwins, imdb_pct10, rt_audiencescore,
         box_off_mult, metascore, imdb_oscars, bw_rating,
         lang_eng, drama, comedy, film)

dim(movies_2)
[1] 201  12
  • Two identifiers (film and film_id)
  • Our outcome fc_pctwins
  • Our nine predictors (from previous slide)

How much missingness do we have?

miss_case_table(movies_2) |> gt() 
n_miss_in_case n_cases pct_cases
0 175 87.064677
1 16 7.960199
2 6 2.985075
3 4 1.990050
miss_var_summary(movies_2) |> filter(n_miss > 0) |> gt()
variable n_miss pct_miss
box_off_mult 20 9.9502488
metascore 10 4.9751244
bw_rating 8 3.9800995
imdb_oscars 2 0.9950249

Visualize missing data

plot_pattern(movies_2, rotate = TRUE) ## from ggmice package

Can we assume MCAR and just do a complete case analysis?

mcar_test(movies_2) |> gt() 
statistic df p.value missing.patterns
114.8076 83 0.01192347 9
  • Note that we’re hurt here by reducing our data set to 12 variables, but that’s the way the cookie crumbles.
mcar_test(movies) |> gt()
statistic df p.value missing.patterns
434.3851 414 0.2356999 13

Model with Complete Cases?

mod_cc <- lm(fc_pctwins ~ 
               imdb_pct10 + rt_audiencescore + box_off_mult + 
               metascore + imdb_oscars + bw_rating + lang_eng + 
               drama + comedy, data = movies_2)

glance(mod_cc) |> gt()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.6737 0.6559019 7.334822 37.85219 9.217967e-36 9 -591.8765 1205.753 1240.566 8876.936 165 175

No data-driven variable selection

Why not?

  • We can’t afford it. Too small a ratio of sample size to predictors.

What about validation?

  • We don’t want to split our sample, certainly.
  • There are other methods (coming in 432) which would help, like k-fold cross validation and bootstrap validation.

Can we do multiple imputation from the start?

Let’s create 25 imputations

More on How Many Imputations here

movies_imp<- mice(movies_2, m = 25, seed = 431012, print = FALSE)

summary(movies_imp)
Class: mids
Number of multiple imputations:  25 
Imputation methods:
         film_id       fc_pctwins       imdb_pct10 rt_audiencescore 
              ""               ""               ""               "" 
    box_off_mult        metascore      imdb_oscars        bw_rating 
           "pmm"            "pmm"            "pmm"            "pmm" 
        lang_eng            drama           comedy             film 
              ""               ""               ""               "" 
PredictorMatrix:
                 film_id fc_pctwins imdb_pct10 rt_audiencescore box_off_mult
film_id                0          1          1                1            1
fc_pctwins             0          0          1                1            1
imdb_pct10             0          1          0                1            1
rt_audiencescore       0          1          1                0            1
box_off_mult           0          1          1                1            0
metascore              0          1          1                1            1
                 metascore imdb_oscars bw_rating lang_eng drama comedy film
film_id                  1           1         1        1     1      1    0
fc_pctwins               1           1         1        1     1      1    0
imdb_pct10               1           1         1        1     1      1    0
rt_audiencescore         1           1         1        1     1      1    0
box_off_mult             1           1         1        1     1      1    0
metascore                0           1         1        1     1      1    0
Number of logged events:  2 
  it im dep     meth     out
1  0  0     constant film_id
2  0  0     constant    film

Visualize missing data

ggmice(movies_imp, aes(x = .imp, y = box_off_mult)) +
  geom_jitter(height = 0, width = 0.2) + labs(x = "Imputation #")

Visualize missing data

ggmice(movies_2, aes(x = box_off_mult, y = fc_pctwins)) +
  geom_point(size = 3)

Visualize imputed data

From all 25 imputations at once

ggmice(movies_imp, aes(x = box_off_mult, y = fc_pctwins)) +
  geom_point(size = 3)

Select Imputation 6

imp6 <- complete(movies_imp, 6) |> as_tibble()
summary(imp6)
   film_id            fc_pctwins   imdb_pct10    rt_audiencescore
 Length:201         Min.   :24   Min.   : 3.80   Min.   :28.00   
 Class :character   1st Qu.:42   1st Qu.:11.60   1st Qu.:76.00   
 Mode  :character   Median :52   Median :15.60   Median :86.00   
                    Mean   :51   Mean   :17.53   Mean   :81.91   
                    3rd Qu.:60   3rd Qu.:22.20   3rd Qu.:92.00   
                    Max.   :79   Max.   :55.00   Max.   :98.00   
  box_off_mult       metascore       imdb_oscars        bw_rating    
 Min.   : 0.0013   Min.   :  9.00   Min.   : 0.0000   Min.   :0.000  
 1st Qu.: 2.8000   1st Qu.: 61.00   1st Qu.: 0.0000   1st Qu.:1.000  
 Median : 4.7000   Median : 71.00   Median : 0.0000   Median :3.000  
 Mean   : 8.5436   Mean   : 71.11   Mean   : 0.9751   Mean   :2.124  
 3rd Qu.: 9.6000   3rd Qu.: 83.00   3rd Qu.: 1.0000   3rd Qu.:3.000  
 Max.   :73.7000   Max.   :100.00   Max.   :11.0000   Max.   :3.000  
    lang_eng          drama            comedy           film          
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Length:201        
 1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
 Median :1.0000   Median :1.0000   Median :0.0000   Mode  :character  
 Mean   :0.8806   Mean   :0.5721   Mean   :0.3582                     
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000                     
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000                     

Correlation Matrix (Imputation 6)

imp6_quants <- imp6 |> select(-film_id, -film)
cor_quants6 <- cor(imp6_quants)

round_half_up(cor_quants6, digits = 2)
                 fc_pctwins imdb_pct10 rt_audiencescore box_off_mult metascore
fc_pctwins             1.00       0.62             0.67         0.24      0.65
imdb_pct10             0.62       1.00             0.61         0.14      0.35
rt_audiencescore       0.67       0.61             1.00         0.22      0.52
box_off_mult           0.24       0.14             0.22         1.00      0.19
metascore              0.65       0.35             0.52         0.19      1.00
imdb_oscars            0.33       0.31             0.22         0.19      0.40
bw_rating             -0.11      -0.04            -0.07         0.05     -0.09
lang_eng              -0.07      -0.17            -0.13         0.05     -0.17
drama                  0.10       0.20             0.23         0.04      0.14
comedy                -0.32      -0.29            -0.21        -0.06     -0.24
                 imdb_oscars bw_rating lang_eng drama comedy
fc_pctwins              0.33     -0.11    -0.07  0.10  -0.32
imdb_pct10              0.31     -0.04    -0.17  0.20  -0.29
rt_audiencescore        0.22     -0.07    -0.13  0.23  -0.21
box_off_mult            0.19      0.05     0.05  0.04  -0.06
metascore               0.40     -0.09    -0.17  0.14  -0.24
imdb_oscars             1.00      0.03     0.12  0.13  -0.19
bw_rating               0.03      1.00     0.01 -0.01   0.04
lang_eng                0.12      0.01     1.00 -0.19   0.08
drama                   0.13     -0.01    -0.19  1.00  -0.26
comedy                 -0.19      0.04     0.08 -0.26   1.00

Correlation Matrix Plot (imp 6)

corrplot(cor_quants6)

Collinearity in Imputation 6

i6_mod <- lm(fc_pctwins ~ imdb_pct10 + rt_audiencescore + 
               box_off_mult + metascore + imdb_oscars + 
               bw_rating + lang_eng + drama + comedy,
             data = imp6)

vif(i6_mod) ## from car package
      imdb_pct10 rt_audiencescore     box_off_mult        metascore 
        1.780058         2.005588         1.087071         1.659157 
     imdb_oscars        bw_rating         lang_eng            drama 
        1.381051         1.020767         1.154302         1.153332 
          comedy 
        1.174549 
  • Any apparent problems with collinearity here?

Scatterplot Matrix (Imputation 6)

ggpairs(imp6_quants)

Box-Cox transformation suggestion?

i6_mod <- lm(fc_pctwins ~ imdb_pct10 + rt_audiencescore + 
               box_off_mult + metascore + imdb_oscars + 
               bw_rating + lang_eng + drama + comedy, data = imp6)
boxCox(i6_mod) 

Build our model in Imputation 6

i6_mod <- lm(fc_pctwins ~ imdb_pct10 + rt_audiencescore + 
               box_off_mult + metascore + imdb_oscars + 
               bw_rating + lang_eng + drama + comedy,
             data = imp6)

glance(i6_mod) |> gt() 
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.6640256 0.6481943 7.471463 41.94396 1.115782e-40 9 -684.3072 1390.614 1426.951 10662.15 191 201

Tidied Coefficients (Imp. 6)

tidy(i6_mod, conf.int = TRUE, conf.level = 0.90) |>
  gt() |> fmt_number(-term, decimals = 3) 
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.506 4.557 0.111 0.912 −7.026 8.038
imdb_pct10 0.422 0.077 5.449 0.000 0.294 0.550
rt_audiencescore 0.268 0.056 4.835 0.000 0.177 0.360
box_off_mult 0.062 0.048 1.288 0.199 −0.018 0.142
metascore 0.306 0.043 7.052 0.000 0.234 0.377
imdb_oscars −0.068 0.328 −0.208 0.835 −0.610 0.474
bw_rating −0.483 0.489 −0.987 0.325 −1.292 0.326
lang_eng 2.742 1.746 1.570 0.118 −0.144 5.628
drama −2.503 1.144 −2.188 0.030 −4.393 −0.612
comedy −2.908 1.191 −2.441 0.016 −4.876 −0.939
par(mfrow = c(2,2)); plot(i6_mod); par(mfrow = c(1,1))

Idealized 5-fold validation strategy

10-fold cross-validation?

set.seed(431432)
ctrl <- trainControl(method = "cv", number = 10) ## caret package

## train our model on those 10 folds

i6_train <- train(fc_pctwins ~ imdb_pct10 + rt_audiencescore + 
                   box_off_mult + metascore + imdb_oscars + 
                   bw_rating + lang_eng + drama + comedy,
                 data = imp6, method = "lm", trControl = ctrl)

Summarize 10-fold cross-validation

i6_train
Linear Regression 

201 samples
  9 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 181, 181, 181, 180, 181, 181, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  7.586781  0.6480235  6.089882

Tuning parameter 'intercept' was held constant at a value of TRUE

Fitted Model after 10-fold CV

i6_train$finalModel

Call:
lm(formula = .outcome ~ ., data = dat)

Coefficients:
     (Intercept)        imdb_pct10  rt_audiencescore      box_off_mult  
         0.50592           0.42223           0.26850           0.06218  
       metascore       imdb_oscars         bw_rating          lang_eng  
         0.30556          -0.06831          -0.48314           2.74188  
           drama            comedy  
        -2.50258          -2.90757  
  • Can use tidy(), glance(), etc. on this

Review Summaries within each fold

i6_train$resample
       RMSE  Rsquared      MAE Resample
1  6.892492 0.6514169 5.673471   Fold01
2  8.676444 0.5861332 7.689770   Fold02
3  7.714714 0.6746787 6.290568   Fold03
4  7.269754 0.6314517 5.860194   Fold04
5  7.397851 0.7347794 6.243674   Fold05
6  8.182343 0.7056599 6.423322   Fold06
7  7.193403 0.6661355 5.777834   Fold07
8  8.546438 0.5977726 6.401908   Fold08
9  5.611769 0.7698717 3.971440   Fold09
10 8.382604 0.4623354 6.566637   Fold10

Pooled Coefficient Estimates

movies_mods <- with(movies_imp, 
                    lm(fc_pctwins ~ imdb_pct10 + rt_audiencescore + 
                         box_off_mult + metascore + imdb_oscars + 
                         bw_rating + lang_eng + drama + comedy))

summary(movies_mods)
# A tibble: 250 × 6
   term             estimate std.error statistic  p.value  nobs
   <chr>               <dbl>     <dbl>     <dbl>    <dbl> <int>
 1 (Intercept)        0.0133    4.46     0.00297 9.98e- 1   201
 2 imdb_pct10         0.435     0.0756   5.76    3.37e- 8   201
 3 rt_audiencescore   0.252     0.0541   4.66    6.00e- 6   201
 4 box_off_mult       0.0755    0.0472   1.60    1.11e- 1   201
 5 metascore          0.318     0.0406   7.84    3.08e-13   201
 6 imdb_oscars       -0.159     0.321   -0.495   6.21e- 1   201
 7 bw_rating         -0.404     0.480   -0.841   4.02e- 1   201
 8 lang_eng           3.04      1.71     1.78    7.59e- 2   201
 9 drama             -2.08      1.12    -1.86    6.41e- 2   201
10 comedy            -2.46      1.17    -2.11    3.63e- 2   201
# ℹ 240 more rows

Pooled Estimates: \(R^2\) & adjusted \(R^2\)

movie_pool <- pool(movies_mods)

pool.r.squared(movies_mods)
          est     lo 95     hi 95        fmi
R^2 0.6684598 0.5853228 0.7384219 0.02694699
pool.r.squared(movies_mods, adjusted = TRUE)
              est     lo 95     hi 95        fmi
adj R^2 0.6528365 0.5670716 0.7254529 0.02757476

Or, you can just run glance on the pooled results…

glance(movie_pool)
  nimp nobs r.squared adj.r.squared
1   25  201 0.6684598     0.6528365

Pooled Coefficient Estimates

summary(movie_pool, conf.int = TRUE, conf.level = 0.90) |> gt() |>
  fmt_number(-term, decimals = 3)
term estimate std.error statistic df p.value 5 % 95 %
(Intercept) 0.323 4.649 0.069 174.894 0.945 −7.366 8.011
imdb_pct10 0.420 0.078 5.400 184.589 0.000 0.291 0.548
rt_audiencescore 0.261 0.056 4.636 180.852 0.000 0.168 0.354
box_off_mult 0.066 0.049 1.357 156.498 0.177 −0.015 0.148
metascore 0.310 0.043 7.133 179.271 0.000 0.238 0.382
imdb_oscars −0.103 0.330 −0.311 183.990 0.756 −0.648 0.442
bw_rating −0.367 0.497 −0.738 181.708 0.462 −1.189 0.455
lang_eng 2.995 1.794 1.669 175.651 0.097 0.028 5.961
drama −2.299 1.146 −2.006 186.254 0.046 −4.194 −0.405
comedy −2.815 1.195 −2.356 185.783 0.019 −4.791 −0.840

Coefficient Estimates: Imp. 6

tidy(i6_mod, conf.int = TRUE, conf.level = 0.90) |>
  gt() |> fmt_number(-term, decimals = 3) 
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.506 4.557 0.111 0.912 −7.026 8.038
imdb_pct10 0.422 0.077 5.449 0.000 0.294 0.550
rt_audiencescore 0.268 0.056 4.835 0.000 0.177 0.360
box_off_mult 0.062 0.048 1.288 0.199 −0.018 0.142
metascore 0.306 0.043 7.052 0.000 0.234 0.377
imdb_oscars −0.068 0.328 −0.208 0.835 −0.610 0.474
bw_rating −0.483 0.489 −0.987 0.325 −1.292 0.326
lang_eng 2.742 1.746 1.570 0.118 −0.144 5.628
drama −2.503 1.144 −2.188 0.030 −4.393 −0.612
comedy −2.908 1.191 −2.441 0.016 −4.876 −0.939

Model on Complete Cases Only

tidy(mod_cc, conf.int = TRUE, conf.level = 0.90) |>
  gt() |> fmt_number(-term, decimals = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) −2.206 5.202 −0.424 0.672 −10.812 6.399
imdb_pct10 0.463 0.081 5.678 0.000 0.328 0.598
rt_audiencescore 0.272 0.061 4.448 0.000 0.171 0.373
box_off_mult 0.063 0.048 1.306 0.193 −0.017 0.143
metascore 0.309 0.045 6.798 0.000 0.234 0.384
imdb_oscars −0.301 0.330 −0.912 0.363 −0.848 0.245
bw_rating −0.182 0.521 −0.349 0.727 −1.045 0.680
lang_eng 3.319 2.300 1.443 0.151 −0.486 7.124
drama −1.063 1.191 −0.893 0.373 −3.033 0.906
comedy −2.324 1.255 −1.851 0.066 −4.400 −0.248

Comparing the Results

  • What about Fit Quality?
Method \(R^2\) Adjusted \(R^2\) # of obs.
Multiple Imputation 0.668 0.653 201
Single Imputation 6 0.664 0.648 201
Complete Cases 0.674 0.656 175

10-fold Cross-Validated \(R^2\) for single imputation 6 was 0.648

Comparing the Estimates

Term MI est MI se MI p I6 est I6 se I6 p CC est CC se CC p
Intercept 0.32 4.65 0.95 0.51 4.56 0.91 -2.21 5.20 0.67
imdb_pct10 0.42 0.08 <.01 0.42 0.08 <.01 0.46 0.08 <.01
rt_audience 0.26 0.06 <.01 0.27 0.06 <.01 0.27 0.06 <.01
box_off_mult 0.07 0.05 0.18 0.06 0.05 0.20 0.06 0.05 0.19
metascore 0.31 0.04 <.01 0.31 0.04 <.01 0.31 0.05 <.01
imdb_oscars -0.10 0.33 0.76 -0.07 0.33 0.84 -0.30 0.33 0.36
bw_rating -0.37 0.50 0.46 -0.48 0.49 0.33 -0.18 0.52 0.73
lang_eng 3.00 1.79 0.10 2.74 1.75 0.12 3.32 2.30 0.15
drama -2.30 1.15 0.05 -2.50 1.14 0.03 -1.06 1.19 0.37
comedy -2.82 1.20 0.02 -2.91 1.19 0.02 -2.32 1.26 0.07

Session Information

session_info()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  abind_1.4-5          askpass_1.2.0        backports_1.4.1     
  base64enc_0.1.3      bigD_0.2.0           bit_4.0.5           
  bit64_4.0.5          bitops_1.0.7         blob_1.2.4          
  boot_1.3-28.1        brio_1.1.3           broom_1.0.5         
  broom.helpers_1.14.0 bslib_0.6.1          cachem_1.0.8        
  callr_3.7.3          car_3.1-2            carData_3.0-5       
  caret_6.0-94         cellranger_1.1.0     class_7.3-22        
  cli_3.6.1            clipr_0.8.0          clock_0.7.0         
  codetools_0.2-19     colorspace_2.1-0     commonmark_1.9.0    
  compiler_4.3.2       conflicted_1.2.0     corrplot_0.92       
  cpp11_0.4.7          crayon_1.5.2         curl_5.1.0          
  data.table_1.14.8    DBI_1.1.3            dbplyr_2.4.0        
  desc_1.4.2           diagram_1.6.5        diffobj_0.3.5       
  digest_0.6.33        dplyr_1.1.4          dtplyr_1.3.1        
  e1071_1.7.13         ellipsis_0.3.2       evaluate_0.23       
  fansi_1.0.5          farver_2.1.1         fastmap_1.1.1       
  fontawesome_0.5.2    forcats_1.0.0        foreach_1.5.2       
  fs_1.6.3             future_1.33.0        future.apply_1.11.0 
  gargle_1.5.2         generics_0.1.3       GGally_2.2.0        
  ggformula_0.12.0     ggmice_0.1.0         ggplot2_3.4.4       
  ggridges_0.5.4       ggstats_0.5.1        glmnet_4.1-8        
  globals_0.16.2       glue_1.6.2           googledrive_2.1.1   
  googlesheets4_1.1.1  gower_1.0.1          graphics_4.3.2      
  grDevices_4.3.2      grid_4.3.2           gridExtra_2.3       
  gt_0.10.0            gtable_0.3.4         hardhat_1.3.0       
  haven_2.5.4          here_1.0.1           highr_0.10          
  hms_1.1.3            htmltools_0.5.7      htmlwidgets_1.6.3   
  httr_1.4.7           ids_1.0.1            ipred_0.9-14        
  isoband_0.2.7        iterators_1.0.14     janitor_2.2.0       
  jomo_2.7-6           jquerylib_0.1.4      jsonlite_1.8.7      
  juicyjuice_0.1.0     KernSmooth_2.23.22   knitr_1.45          
  labeling_0.4.3       labelled_2.12.0      lattice_0.21-9      
  lava_1.7.3           lifecycle_1.0.4      listenv_0.9.0       
  lme4_1.1-35.1        lubridate_1.9.3      magrittr_2.0.3      
  markdown_1.11        MASS_7.3-60          Matrix_1.6-4        
  MatrixModels_0.5.3   memoise_2.0.1        methods_4.3.2       
  mgcv_1.9.0           mice_3.16.0          mime_0.12           
  minqa_1.2.6          mitml_0.4-5          ModelMetrics_1.2.2.2
  modelr_0.1.11        mosaic_1.9.0         mosaicCore_0.9.4.0  
  mosaicData_0.20.4    munsell_0.5.0        naniar_1.0.0        
  nlme_3.1-163         nloptr_2.0.3         nnet_7.3-19         
  norm_1.0-11.1        numDeriv_2016.8.1.1  openssl_2.1.1       
  ordinal_2022.11.16   pan_1.9              parallel_4.3.2      
  parallelly_1.36.0    patchwork_1.1.3      pbkrtest_0.5.2      
  pillar_1.9.0         pkgbuild_1.4.2       pkgconfig_2.0.3     
  pkgload_1.3.3        plyr_1.8.9           praise_1.0.0        
  prettyunits_1.2.0    pROC_1.18.5          processx_3.8.2      
  prodlim_2023.08.28   progress_1.2.2       progressr_0.14.0    
  proxy_0.4.27         ps_1.7.5             purrr_1.0.2         
  quantreg_5.97        R6_2.5.1             ragg_1.2.6          
  rappdirs_0.3.3       RColorBrewer_1.1-3   Rcpp_1.0.11         
  RcppEigen_0.3.3.9.4  reactable_0.4.4      reactR_0.5.0        
  readr_2.1.4          readxl_1.4.3         recipes_1.0.8       
  rematch_2.0.0        rematch2_2.1.2       reprex_2.0.2        
  reshape2_1.4.4       rlang_1.1.2          rmarkdown_2.25      
  rpart_4.1.21         rprojroot_2.0.4      rstudioapi_0.15.0   
  rvest_1.0.3          sass_0.4.7           scales_1.3.0        
  selectr_0.4.2        shape_1.4.6          snakecase_0.11.1    
  SparseM_1.81         splines_4.3.2        SQUAREM_2021.1      
  stats_4.3.2          stats4_4.3.2         stringi_1.8.2       
  stringr_1.5.1        survival_3.5-7       sys_3.4.2           
  systemfonts_1.0.5    testthat_3.2.1       textshaping_0.3.7   
  tibble_3.2.1         tidyr_1.3.0          tidyselect_1.2.0    
  tidyverse_2.0.0      timechange_0.2.0     timeDate_4022.108   
  tinytex_0.49         tools_4.3.2          tzdb_0.4.0          
  ucminf_1.2.0         UpSetR_1.4.0         utf8_1.2.4          
  utils_4.3.2          uuid_1.1.1           V8_4.4.0            
  vctrs_0.6.5          viridis_0.6.4        viridisLite_0.4.2   
  visdat_0.6.0         vroom_1.6.4          waldo_0.5.2         
  withr_2.5.2          xfun_0.41            xml2_1.3.5          
  yaml_2.3.7